# Load libraries for EDA
library('tidyverse') # Data wrangling & visualization
library('gt') # Create professional tables
library('janitor') # Clean column names & messy data
library('psych') # Descriptive stats & psych research tools
library('stringr') # String manipulation
library('lubridate') # Handle dates & times
library('rmarkdown') # Render R Markdown docs
library('dplyr') # Data manipulation (filter, select, mutate, etc.)
library('skimr') # Quick summary stats
library('tidyr') # Reshape/tidy data
library('readxl') # Read Excel files
library('ggplot2') # Data visualization
library('readr') # Read/write CSV & text files
library('knitr') # Generate dynamic reports
library('leaflet') # Generate dynamic map
# Read files into session
transactions <- as.data.frame(data.table::fread("data/transactional_data.csv"))
customer_address <- read.csv("data/customer_address_and_zip_mapping.csv")
customer_profile <- read.csv("data/customer_profile.csv")
delivery_cost <- readxl::read_xlsx('data/delivery_cost_data.xlsx')Exploratory Data Analysis | Swire Coca-Cola Capstone Project
IS 6813-001, Spring 2025 | Group 3
Business Problem Statement
Regional beverage bottler Swire Coca-Cola (SCCU) relies on two business models: 1) “Red Truck”, which features high-volume customers serviced personally by Swire, and 2) “White Truck” or “Alternate Route to Market”, which entails smaller customers serviced by a third-party distributor.
Swire’s current segmenting strategy has led to misallocation of resources, inflated expenses, and missed opportunities from clients with high-growth potential. Swire aims to better algin their customers with the business proposition of these models by identifying customer characteristics and rdering behavior that better determines the right business model for the long-term relationship.
EDA Introduction
In this exploratory data analysis (EDA) notebook, we assess the quality of the provided data, detail the data cleaning processes undertaken, define our success metrics, and outline the key questions driving our analysis. By examining customer characteristics and ordering behaviors, we aim to realign Swire’s customer segmentation with the strategic propositions of their two distribution models, ultimately enabling a more effective long-term relationship strategy.
File Investigation
Libraries & Data
We begin by setting up our session with the necessary libraries and data provided by Swire. These will be referenced often throughout this document.
Files from Swire
Swire Coca-Cola provided us with four (4) files we can use for the project. These include:
| Name | Description | Format | Columns | Rows |
|---|---|---|---|---|
| transactional_data | This dataset records detailed transactional information, including order quantities and delivery metrics. | .csv | 11 | 1,045,540 |
| customer_address_and_zip_mapping | This dataset maps ZIP codes to full address information. | .csv | 2 | 1,801 |
| customer_profile | This dataset provides detailed information about customers, including onboarding and purchasing behavior. | .csv | 11 | 30,478 |
| delivery_cost_data | This dataset describes the median 'per gallon/case' cost of delivery according to annual order volume and custoemr profile. | .xlsx | 5 | 160 |
Let’s take a peak at each of these in turn. We’ll remark on:
- Completeness (how many data are missing)
- Data types (are fields properly formatted for analysis)
- Distribution (how values are spread out)
- Potential wrangling (what we can do for improving data set quality)
transactional_data.csv
skim(transactions)| Name | transactions |
| Number of rows | 1045540 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| TRANSACTION_DATE | 0 | 1 | 8 | 10 | 0 | 723 | 0 |
| ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| WEEK | 0 | 1 | 26.23 | 14.52 | 1.0 | 14 | 26 | 38.00 | 52.00 | ▇▇▇▇▇ |
| YEAR | 0 | 1 | 2023.50 | 0.50 | 2023.0 | 2023 | 2023 | 2024.00 | 2024.00 | ▇▁▁▁▇ |
| CUSTOMER_NUMBER | 0 | 1 | 546643776.32 | 49426585.56 | 500245678.0 | 501091920 | 501548213 | 600080939.00 | 600975408.00 | ▇▁▁▁▇ |
| ORDERED_CASES | 0 | 1 | 26.85 | 126.76 | 0.0 | 0 | 7 | 18.50 | 8479.89 | ▇▁▁▁▁ |
| LOADED_CASES | 0 | 1 | 25.92 | 122.79 | 0.0 | 0 | 7 | 18.00 | 8171.56 | ▇▁▁▁▁ |
| DELIVERED_CASES | 0 | 1 | 25.13 | 121.52 | -3132.0 | 0 | 6 | 17.33 | 8069.48 | ▁▇▁▁▁ |
| ORDERED_GALLONS | 0 | 1 | 9.87 | 26.47 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| LOADED_GALLONS | 0 | 1 | 9.60 | 25.65 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| DELIVERED_GALLONS | 0 | 1 | 9.21 | 25.18 | -1792.5 | 0 | 0 | 12.50 | 2292.50 | ▁▁▇▁▁ |
glimpse(transactions)Rows: 1,045,540
Columns: 11
$ TRANSACTION_DATE <chr> "1/5/2023", "1/6/2023", "1/9/2023", "1/11/2023", "1/…
$ WEEK <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 9…
$ YEAR <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ CUSTOMER_NUMBER <int> 501202893, 500264574, 501174701, 600586532, 50101432…
$ ORDER_TYPE <chr> "MYCOKE LEGACY", "MYCOKE LEGACY", "MYCOKE LEGACY", "…
$ ORDERED_CASES <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0, 18.0…
$ LOADED_CASES <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ DELIVERED_CASES <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ ORDERED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ LOADED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ DELIVERED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
Completeness
The data are fully complete; there’s not a single column with missing or empty values.
Data Types
We have a fair mix of data types, though some are improperly casted. Indentifiers like CUSOTMER_NUMBER shouldn’t be continuous, and dates like TRANSACTION_DATE can be enriched with a date/datetime format.
Distribution
The shape of the data seen is quite skewed: most all of the volume columns are highly skewed right (meaning most transactions are small). We do see some “negative” delivery volumes, which we are to understand as returns.
Potential Wrangling
This data set will benefit most from proper data types. Additionally, we can determine (for each customer) how often they see returned transactions (a potentially helpful indicator for client quality).
We don’t yet have a good sense for 1) the breadth of data longitudinally and 2) what order types exist. Let’s solve these with some basic visualizations.
transactions |>
mutate(
TRANSACTION_DATE = lubridate::mdy(TRANSACTION_DATE)
) |>
group_by(TRANSACTION_DATE) |>
count() |>
ungroup() |>
ggplot(
aes(TRANSACTION_DATE, n)
) +
geom_line(color = swire_colors$blue) +
geom_smooth(color = swire_colors$red) +
scale_x_date() +
labs(
title = "Daily Transaction Volume",
x = "Transaction Date"
) +
theme_swire() +
theme(
axis.title.y = element_blank()
)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There’s obviously high variability in the number of transactions per day, but we are seeing a smoothing where the average day per week sees a transaction volume between 1,000 and 2,000. Additionally, it’s clear we have two (2) years of transactions: 2023 & 2024.
order_types <-
transactions |>
group_by(ORDER_TYPE) |>
count() |>
ungroup() |>
mutate(perc = n / sum(n)) |>
arrange(n)
order_types$ORDER_TYPE = factor(order_types$ORDER_TYPE, levels = order_types$ORDER_TYPE)
ggplot(
order_types,
aes(n, ORDER_TYPE, label = scales::percent_format()(perc))
) +
geom_col(fill = swire_colors$red) +
geom_text(
aes(
hjust = ifelse(n < 1e5, -0.1, 1.1),
color = ifelse(n < 1e5, "black", "white"),
)
) +
scale_color_identity() +
labs(
title = "Distribution of `ORDER_TYPE`"
) +
theme_swire() +
theme(
axis.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)Okay, we now see there’s 6 unique types and some values that indicate null. That appears to be a text “null” so that’s an opportunity to clean up. There’s clearly 3 order types of largest interest:
- MYCOKE LEGACY
- CALL CENTER
- SALES REP
customer_address_and_zip_mapping.csv
skim(customer_address)| Name | customer_address |
| Number of rows | 1801 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| full.address | 0 | 1 | 45 | 73 | 0 | 1801 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zip | 0 | 1 | 28919.81 | 25588.64 | 1001 | 2153 | 21634 | 42440 | 71483 | ▇▃▅▁▅ |
glimpse(customer_address)Rows: 1,801
Columns: 2
$ zip <int> 71018, 71021, 71023, 71024, 71039, 71055, 71058, 71071, 7…
$ full.address <chr> "71018,Cotton Valley,Louisiana,LA,Webster,119,32.819,-93.…
Completeness
We aren’t seeing any incomplete/missing values so far here. That’s great news.
Data Types
The data types aren’t quite right. Because of leading zeros and 9 digit zip codes, zip should really be of type “character”. Also, we’re seeing the full address is comma-separated; not ideal.
Distribution
There’s no field where assessing distribution is helpful/relevant.
Potential Wrangling
Pulling out the values from full.address will be helpful. We can then make sure those are of proper data types.
One thing we don’t have a great sense for right now is where the addresses are located. Let’s see if we can’t make a dynamic map showing where customers are located:
Swire Market Geography
We’ll first pull out the latitude/longitude values.
addr_exp <-
customer_address |>
separate(
full.address,
into = c("zip", "city", "state", "state_abbr", "county", "region", "lat", "lon"),
sep = ",",
convert = TRUE
)Then we’ll add them to a map:
swire_map <-
leaflet(height = 800, width = 800) |>
addTiles() |>
setView(lng = mean(addr_exp$lon), lat = mean(addr_exp$lat), zoom = 5) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(
data = addr_exp,
lng = ~lon,
lat = ~lat,
radius = 6,
color = swire_colors$red,
fillOpacity = 0.75
)
swire_mapOkay, we’re seeing customers largely in Kansas, Kentucky, Maryland, and Massachusetts, with some in Louisiana. It’s not understood to what extent, if any, this geographic will hold value in a more optimized delivery model solution for Swire, but we’ll ensure the data are wrangled properly regardless.
customer_profile.csv
skim(customer_profile)| Name | customer_profile |
| Number of rows | 30478 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| logical | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| FREQUENT_ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 6 | 0 |
| FIRST_DELIVERY_DATE | 0 | 1 | 8 | 10 | 0 | 2401 | 0 |
| ON_BOARDING_DATE | 0 | 1 | 8 | 10 | 0 | 6487 | 0 |
| COLD_DRINK_CHANNEL | 0 | 1 | 5 | 13 | 0 | 9 | 0 |
| TRADE_CHANNEL | 0 | 1 | 6 | 28 | 0 | 26 | 0 |
| SUB_TRADE_CHANNEL | 0 | 1 | 4 | 27 | 0 | 48 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| LOCAL_MARKET_PARTNER | 0 | 1 | 0.90 | TRU: 27355, FAL: 3123 |
| CO2_CUSTOMER | 0 | 1 | 0.39 | FAL: 18496, TRU: 11982 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CUSTOMER_NUMBER | 0 | 1.0 | 538301800.92 | 47950644.47 | 500245678 | 501164306 | 501573994 | 600075795 | 600975408 | ▇▁▁▁▅ |
| PRIMARY_GROUP_NUMBER | 18196 | 0.4 | 2779.85 | 2608.64 | 4 | 444 | 1892 | 4488 | 9999 | ▇▃▂▁▁ |
| ZIP_CODE | 0 | 1.0 | 30252.25 | 25953.08 | 1001 | 2155 | 21771 | 42762 | 71483 | ▇▃▅▁▆ |
glimpse(customer_profile)Rows: 30,478
Columns: 11
$ CUSTOMER_NUMBER <int> 501556470, 501363456, 600075150, 500823056, 60008…
$ PRIMARY_GROUP_NUMBER <int> 376, NA, 2158, 2183, 1892, NA, 9996, NA, NA, 8803…
$ FREQUENT_ORDER_TYPE <chr> "MYCOKE LEGACY", "SALES REP", "SALES REP", "OTHER…
$ FIRST_DELIVERY_DATE <chr> "1/2/2024", "4/14/2022", "3/4/2016", "2/6/2019", …
$ ON_BOARDING_DATE <chr> "8/28/2023", "3/22/2022", "3/22/2012", "11/23/201…
$ COLD_DRINK_CHANNEL <chr> "DINING", "DINING", "DINING", "DINING", "PUBLIC S…
$ TRADE_CHANNEL <chr> "FAST CASUAL DINING", "COMPREHENSIVE DINING", "FA…
$ SUB_TRADE_CHANNEL <chr> "PIZZA FAST FOOD", "FSR - MISC", "OTHER FAST FOOD…
$ LOCAL_MARKET_PARTNER <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, …
$ CO2_CUSTOMER <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ ZIP_CODE <int> 21664, 1885, 67073, 1885, 1203, 67952, 40517, 156…
Completeness
All columns appear complete with the exception of PRIMARY_GROUP_NUMBER; there’s appropximately 60% of records with missing values here. This is interpreted as customers who are “stand alone”, as in not part of a corporate franchising arm. Therefore, it’s understandable these values are missing.
Data Types
Data types are largely appropriate. We have several “character” types we can turn into factors. Some dates should be appropriately cast. We do need to change identifiers to “character”. And similar processing as already described will be done for ZIP_CODE.
Distribution
There’s not a ton to note here except that the vast majority of customers are “local market parneters” and most aren’t purchasing C02 from Swire, either because they source elsewhere or are not fountain customers.
Potential Wrangling
It’s likely that a customer belonging to a franchise would be evaluated differently than a stand alone customer, all else being equal. We can derive a value for the number of franchises belonging to the primary group.
Let’s see what values/frequency is inherent to COLD_DRINK_CHANNEL:
channel <-
customer_profile |>
group_by(COLD_DRINK_CHANNEL) |>
count() |>
ungroup() |>
mutate(perc = n / sum(n)) |>
arrange(n)
channel$COLD_DRINK_CHANNEL = factor(channel$COLD_DRINK_CHANNEL, levels = channel$COLD_DRINK_CHANNEL)
ggplot(
channel,
aes(n, COLD_DRINK_CHANNEL, label = scales::percent_format()(perc))
) +
geom_col(fill = swire_colors$red) +
geom_text(
aes(
hjust = ifelse(n < 10000, -0.1, 1.1),
color = ifelse(n < 10000, "black", "white"),
)
) +
scale_color_identity() +
labs(
title = "Distribution of `COLD_DRINK_CHANNEL`"
) +
theme_swire() +
theme(
axis.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)There’s 9 different channels, largely dominated by “DINING”. Others of interest include “EVENT” and “GOODS”.
delivery_cost_data.xlsx
skim(delivery_cost)| Name | delivery_cost |
| Number of rows | 160 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Cold Drink Channel | 0 | 1 | 5 | 13 | 0 | 8 | 0 |
| Vol Range | 0 | 1 | 5 | 11 | 0 | 10 | 0 |
| Applicable To | 0 | 1 | 8 | 16 | 0 | 2 | 0 |
| Cost Type | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Median Delivery Cost | 0 | 1 | 2.6 | 1.71 | 0.37 | 1.33 | 2.24 | 3.47 | 8.59 | ▇▆▂▁▁ |
glimpse(delivery_cost)Rows: 160
Columns: 5
$ `Cold Drink Channel` <chr> "WORKPLACE", "WORKPLACE", "WORKPLACE", "WORKPLA…
$ `Vol Range` <chr> "0 - 149", "150 - 299", "300 - 449", "450 - 599…
$ `Applicable To` <chr> "Bottles and Cans", "Bottles and Cans", "Bottle…
$ `Median Delivery Cost` <dbl> 8.0649504, 4.1656458, 2.9915579, 2.5242219, 2.0…
$ `Cost Type` <chr> "Per Case", "Per Case", "Per Case", "Per Case",…
Completeness
All values are present; there shouldn’t be any imputation necessary.
Data Types
Vol Range is a field that needs to be split out into a “min”/“max”. Everyting else seems very reasonable.
Distribution
In most cases, the distribution of Median Delivery Cost is skewed right, which is indicative of outlier costs for troublesome clients.
Potential Wrangling
It’s notable that the column names are formatted differently. This is an area of standardization. Additionally, we’ll need to do quite a bit of prep to get the appropriate annual volumes by customer and intersect with this lookup table.
delivery_cost$`Cold Drink Channel` = factor(delivery_cost$`Cold Drink Channel`, levels = channel$COLD_DRINK_CHANNEL)
ggplot(
delivery_cost,
aes(`Median Delivery Cost`, `Cold Drink Channel`)
) +
geom_boxplot(color = swire_colors$red) +
scale_x_continuous(labels = scales::label_currency()) +
facet_wrap(~`Applicable To`, nrow = 1, scales = "free") +
labs(
title = "Distribution of Median Delivery Cost",
subtitle = "By `Cold Drink Channel` and `Product Type`"
) +
theme_swire() +
theme(
axis.title = element_blank()
)Interestingly, there’s some real variance among costs in some Cold Drink Channels but less in others. Additionally, it’s not consistent between “Bottles and Cans” and “Fountain” products.
Data Wrangling
Having explored the files individually and making note of important details, we can now move on to wrangling these data into a single object, suitable for the more in-depth analysis upcoming. This wrangling will comprise of:
- Prep individual files for level-of-detail
- Combining the four (4) individual files we collected
- Standardizing column names
- Developing a theory for ordering said columns
- Casting fields to their proper data types
- Derived columns that further describe customer profiles
- Settle on transactional level data with appropriate cost estimates
- Etc.
Prepping the Cost Data
delivery_cost_expanded <-
delivery_cost |>
# Split the volume range into an object
mutate(
range_obj = purrr::map(`Vol Range`, str_split, " - ")
) |>
# Unnest the object for individual reference
unnest(range_obj) |>
unnest_wider(range_obj, names_sep = "_") |>
# Handle the "1350+" scenario
mutate(
min_vol = purrr::map_chr(range_obj_1, str_replace, "\\+", ""),
max_vol = ifelse(is.na(range_obj_2), (2^31) - 1, range_obj_2)
) |>
# Turn volumes from charaters to integers
mutate(
across(min_vol:max_vol, as.integer)
) |>
# Drop irrelevant columns
select(-c(range_obj_1, range_obj_2, `Vol Range`))annual_cust_volume <-
# Take transaction level data
transactions |>
# Bring in the customer profile for the `Cold Drink Channel`
inner_join(
customer_profile,
join_by(CUSTOMER_NUMBER)
) |>
# Get annual cases/gallons by customer
group_by(YEAR, CUSTOMER_NUMBER, COLD_DRINK_CHANNEL) |>
summarise(
annual_cases = sum(DELIVERED_CASES),
annual_gallons = sum(DELIVERED_GALLONS),
.groups = "drop"
)delivery_cost_tiers <-
annual_cust_volume |>
left_join(
delivery_cost_expanded |> filter(`Applicable To` != 'Fountain'),
join_by(COLD_DRINK_CHANNEL == `Cold Drink Channel`, annual_cases >= min_vol, annual_cases <= max_vol)
) |>
left_join(
delivery_cost_expanded |> filter(`Applicable To` == 'Fountain'),
join_by(COLD_DRINK_CHANNEL == `Cold Drink Channel`, annual_gallons >= min_vol, annual_gallons <= max_vol),
suffix = c(".c", ".g")
) |>
select(
YEAR, CUSTOMER_NUMBER,
case_delivery_cost = `Median Delivery Cost.c`,
gallon_delivery_cost = `Median Delivery Cost.g`
)
# Take a peek
head(delivery_cost_tiers)# A tibble: 6 × 4
YEAR CUSTOMER_NUMBER case_delivery_cost gallon_delivery_cost
<int> <int> <dbl> <dbl>
1 2023 500245678 4.47 2.68
2 2023 500245685 8.59 2.01
3 2023 500245686 7.33 4.62
4 2023 500245687 5.59 3.76
5 2023 500245689 8.59 2.53
6 2023 500245690 8.59 3.02
Prep the Customer Address Object
cust_addr_expanded <-
customer_address |>
# Split the full address into an object
mutate(
addr_obj = purrr::map(full.address, str_split, ",")
) |>
# Unnest the object for individual reference
unnest(addr_obj) |>
unnest_wider(addr_obj, names_sep = "_") |>
# Pad the zip code with leading zeros and make a character
mutate(
zip = str_pad(zip, 5, "left", pad = "0")
) |>
# Rename columns
rename(
city = addr_obj_2,
state = addr_obj_3,
state_abbr = addr_obj_4,
county = addr_obj_5,
lat = addr_obj_7,
lon = addr_obj_8
) |>
# Turn lat/lon values to numeric
mutate(
across(lat:lon, as.numeric)
) |>
# Drop irrelevant columns
select(-c(full.address, addr_obj_1, addr_obj_6))Combine Individual Files
combined_data_raw <-
# Take transactions
transactions |>
# Join the customer profile data thereto
inner_join(
customer_profile |> mutate(zip = str_pad(
ZIP_CODE, 5, "left", "0"
)),
join_by(CUSTOMER_NUMBER)
) |>
# Join the customer address data thereto
inner_join(
cust_addr_expanded,
join_by(zip)
) |>
# Join the delivery cost tiers data thereto
inner_join(
delivery_cost_tiers,
join_by(YEAR, CUSTOMER_NUMBER)
)Standardize Names & Data Types
combined_data_std <-
# Take the combined data from above
combined_data_raw |>
# Standardize the names
clean_names() |>
# Standardize data types
mutate(
# Convert charater dates to date types
across(c(transaction_date, first_delivery_date, on_boarding_date), lubridate::mdy),
# Turn IDs into characters
across(c(customer_number, primary_group_number), as.character),
# Turn finite categorical fields into factors
across(
c(order_type, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, state, state_abbr),
as.factor
)
) |>
# Remove irrelevant columns
select(-c(zip_code))Enrich Dataset with New Fields
swire_data_full <-
combined_data_std |>
# Add new fields
mutate(
# Calculate delivered gallons cost
# Assume a return is only half as costly as a normal delivery
delivered_gallons_cost = case_when(
delivered_gallons < 0 ~ -1 * delivered_gallons * gallon_delivery_cost / 2,
TRUE ~ delivered_gallons * gallon_delivery_cost
),
# Calculate delivered case cost
# Assume a return is only half as costly as a normal delivery
delivered_cases_cost = case_when(
delivered_cases < 0 ~ -1 * delivered_cases * case_delivery_cost / 2,
TRUE ~ delivered_cases * case_delivery_cost
),
# Create 'total' columns representing the sum of gallons & cases
ordered_total = ordered_gallons + ordered_cases,
loaded_total = loaded_gallons + loaded_cases,
delivered_total = delivered_gallons + delivered_cases,
) |>
group_by(year, primary_group_number) |>
mutate(
# Calculate number of customers belonging to each primary group by year
primary_group_customers = ifelse(is.na(primary_group_number), 0, n_distinct(customer_number))
) |>
group_by(year, customer_number) |>
mutate(
# Calculate how often a customer issues a return each year
return_frequency = sum(ifelse(delivered_cases < 0 | delivered_gallons < 0, 1, 0))
) |>
ungroup() |>
# Drop select columns that are no longer relevant
select(-c(gallon_delivery_cost, case_delivery_cost)) |>
# Order the columns logically
select(
# CUSTOMER PROFILE ITEMS
customer_number, primary_group_number, primary_group_customers,
on_boarding_date, first_delivery_date, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, local_market_partner, co2_customer, city, zip, state, state_abbr, county, lat, lon,
# TRANSACTION DETAILS
transaction_date, week, year, order_type,
ordered_cases, loaded_cases, delivered_cases, delivered_cases_cost,
ordered_gallons, loaded_gallons, delivered_gallons, delivered_gallons_cost,
ordered_total, loaded_total, delivered_total,
return_frequency
)Final Data Set
glimpse(swire_data_full)Rows: 1,045,540
Columns: 34
$ customer_number <chr> "501202893", "500264574", "501174701", "600586…
$ primary_group_number <chr> NA, "1894", NA, "8397", "1993", "437", NA, NA,…
$ primary_group_customers <dbl> 0, 47, 0, 49, 78, 651, 0, 0, 0, 0, 28, 0, 0, 8…
$ on_boarding_date <date> 2021-04-02, 2015-12-08, 2021-01-26, 1997-02-2…
$ first_delivery_date <date> 2021-05-07, 2018-03-23, 2021-04-12, 2017-05-0…
$ cold_drink_channel <fct> DINING, WELLNESS, DINING, BULK TRADE, GOODS, G…
$ frequent_order_type <fct> SALES REP, OTHER, MYCOKE360, SALES REP, SALES …
$ trade_channel <fct> COMPREHENSIVE DINING, HEALTHCARE, FAST CASUAL …
$ sub_trade_channel <fct> FSR - MISC, OTHER HEALTHCARE, OTHER FAST FOOD,…
$ local_market_partner <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRU…
$ co2_customer <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ city <chr> "Mahaska", "Banner", "Chelmsford", "Bernardsto…
$ zip <chr> "66955", "41603", "01824", "01337", "67473", "…
$ state <fct> Kansas, Kentucky, Massachusetts, Massachusetts…
$ state_abbr <fct> KS, KY, MA, MA, KS, KS, MD, MA, KY, KS, MD, KY…
$ county <chr> "Washington", "Floyd", "Middlesex", "Franklin"…
$ lat <dbl> 39.9845, 37.5707, 42.5911, 42.6952, 39.4194, 3…
$ lon <dbl> -97.3453, -82.6806, -71.3556, -72.5772, -98.69…
$ transaction_date <date> 2023-01-05, 2023-01-06, 2023-01-09, 2023-01-1…
$ week <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7…
$ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ order_type <fct> MYCOKE LEGACY, MYCOKE LEGACY, MYCOKE LEGACY, S…
$ ordered_cases <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0…
$ loaded_cases <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases_cost <dbl> 8.585482, 56.436259, 17.170965, 49.293328, 134…
$ ordered_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ loaded_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons_cost <dbl> 129.7426990, 0.0000000, 0.0000000, 7.1199352, …
$ ordered_total <dbl> 91.000000, 12.500000, 2.000000, 20.500000, 29.…
$ loaded_total <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ delivered_total <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ return_frequency <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
Now, we have a single, standardized data set that is enriched, properly formatted, and well-suited to the remaining analysis.
Data Exploration
Fulfillment and Order Frequency
In the table below, I calculated the fulfillment rate for various customers. The fulfillment rate is defined as total delivered cases divided by total ordered cases.
- A fulfillment rate of 1 means the customer receives exactly what they order.
- A fulfillment rate greater than 1 indicates they are being delivered more than they ordered.
It may be worth investigating whether these customers have high return rates or if they consistently receive more than they request. If it’s the latter, this pattern could be an indicator of high sales demand.
# Calculate fulfillment rates by customer
fulfillment_rates <- swire_data_full %>%
group_by(customer_number) %>%
summarise(
total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
) %>%
filter(!is.na(fulfillment_rate)) %>% # Remove rows where fulfillment rate is undefined
arrange(desc(fulfillment_rate))
# Define the customer numbers to keep for table
selected_customers <- c("501261447", "501563326", "501580838", "501686677", "501483916", "600563579",
"501004189", "501004936", "501006776", "501006936", "501007325", "501007477", "501007757")
# Summarizing the fulfillment rates and filtering the selected customers for table
filtered_fulfillment_rates <- swire_data_full %>%
group_by(customer_number) %>%
summarise(
total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
) %>%
filter(customer_number %in% selected_customers) %>% # Keep only the selected customers
arrange(desc(fulfillment_rate))
# Display the filtered table
gt(filtered_fulfillment_rates) |>
tab_header(
title = "Fulfillment Rates for Selected Customers"
)| Fulfillment Rates for Selected Customers | |||
|---|---|---|---|
| customer_number | total_ordered_cases | total_delivered_cases | fulfillment_rate |
| 501261447 | 3.0 | 18.0 | 6.0000000 |
| 501563326 | 1.0 | 6.0 | 6.0000000 |
| 501580838 | 2.0 | 12.0 | 6.0000000 |
| 501686677 | 1.0 | 6.0 | 6.0000000 |
| 600563579 | 159.0 | 915.0 | 5.7547170 |
| 501483916 | 43.0 | 221.5 | 5.1511628 |
| 501004189 | 50.0 | 50.0 | 1.0000000 |
| 501004936 | 10.5 | 10.5 | 1.0000000 |
| 501006776 | 801.0 | 801.0 | 1.0000000 |
| 501006936 | 52.0 | 52.0 | 1.0000000 |
| 501007325 | 20.0 | 20.0 | 1.0000000 |
| 501007757 | 76.5 | 76.5 | 1.0000000 |
| 501007477 | 113.0 | 109.0 | 0.9646018 |
The below plot indicates that Travel, Superstore, and Bulk Trade channels have the highest average total ordered, followed by General Activities and Academic Institutions. Other trade channels, including Healthcare, Recreation, and Defense, have lower average total ordered values. This comparison highlights that specific channels tend to have a very high average total ordered where others are consistently lower.
# Aggregate data to calculate the average total ordered per trade channel
top_trade_channels <- swire_data_full %>%
group_by(trade_channel) %>%
# summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm = TRUE)) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_total, na.rm = TRUE)) %>%
arrange(desc(AVG_TOTAL_ORDERED)) %>%
slice_head(n = 20) # Select top 10 trade channels
# Create bar plot
ggplot(top_trade_channels, aes(x = reorder(trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill
= trade_channel)) +
geom_bar(stat = "identity", alpha = 0.7, show.legend = FALSE) + # Bar plot without legend
labs(
title = "Top 10 Trade Channels by Average Total Ordered",
x = "Trade Channel",
y = "Average Total Ordered"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # Rotate x-axis labels for readabilityThe graph below illustrates that Cruise, Online Store, and Recreation Park have the highest average total ordered volumes. Additionally, Bulk Trade, Other Travel, and Comprehensive Provider exhibit notable order volumes, further reinforcing their contribution to overall sales.
On the other hand, other sub-trade channels, such as Game Center, Recreation Film, and Fast Food, show comparatively lower average total ordered values. This comparison highlights the significant variation in demand across sub-trade channels, suggesting that order volume trends are strongly influenced by industry-specific purchasing behaviors.
# Aggregate data to calculate the average total ordered per sub-trade channel
top_10_sub_trade <- swire_data_full %>%
group_by(sub_trade_channel) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm =TRUE)) %>%
arrange(desc(AVG_TOTAL_ORDERED)) %>%
slice_head(n =20) # Select top 20 %>%
ggplot(top_10_sub_trade, aes(x = reorder(sub_trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill = sub_trade_channel)) +
geom_bar(stat ="identity",
alpha = 0.7,
show.legend =FALSE) + # Bar plot without legend
labs( title ="Top 10 Sub Trade Channels by Average Total Ordered", x ="Sub Trade Channel", y ="Average Total Ordered") +
theme_minimal() +
theme(axis.text.x = element_text(angle =45, hjust =1, vjust =1)) # Rotate x-axis labels for readabilityOrder Type
The graph below indicates that higher order volumes primarily come from MyCoke Legacy. This trend could be attributed to the fact that most sales originate from MyCoke Legacy and Sales Rep orders, or it may suggest a potential link between order type and high-growth potential customers.
When analyzing customers with < 400 volume, Call Center is the most common order type, while Sales Rep orders drop to third place. This shift suggests that order type could be a valuable characteristic for identifying high-growth potential customers, particularly among lower-volume segments. Further analysis could help determine whether order type patterns correlate with customer expansion trends.
# Aggregate order data by order type
order_type_summary <- swire_data_full %>%
group_by(order_type) %>%
summarise(total_gallons = sum(ordered_cases + ordered_gallons, na.rm = TRUE))
# Plot total ordered cases by order type
ggplot(order_type_summary, aes(x = reorder(order_type, -total_gallons), y = total_gallons, fill = order_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Gallons")# Plot total ordered cases by order type for customers with < 400 gallons per year
swire_data_full %>%
group_by(customer_number, order_type) %>%
summarize(total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)) %>%
filter(total_gallons < 400) %>%
group_by(order_type) %>%
summarize(gallons = sum(total_gallons)) %>%
ggplot(aes(x = reorder(order_type, -gallons), y = gallons, fill = order_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Cases")`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Cold Drink Channel
The graphs below indicate that Dining and Goods are the primary cold drink channels for customers with a volume of less than 400 gallons, whereas Bulk Trade and Dining dominate across all customers. This distinction suggests that conducting a separate ‘Dining’ cold drink channel analysis could be valuable in comparing the characteristics and ordering behaviors between customers below and above the 400-gallon threshold.
# for all customers
swire_data_full %>%
group_by(cold_drink_channel) %>%
summarize(gallons = sum(ordered_gallons + ordered_cases)) %>%
ggplot(aes(x = reorder(cold_drink_channel, - gallons), y = gallons, fill = cold_drink_channel)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Barplot of Ordered Gallons by Cold Drink Channel") +
labs(x = "Cold Drink Channel", y = "Total Gallons") # for customers with < 400 gallons per year
swire_data_full %>%
group_by(customer_number, cold_drink_channel) %>%
summarize(total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)) %>%
filter(total_gallons < 400) %>%
group_by(cold_drink_channel) %>%
summarize(gallons = sum(total_gallons)) %>%
ggplot(aes(reorder(cold_drink_channel, - gallons), y = gallons, fill = cold_drink_channel)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Ordered Gallons by Cold Drink Channel (Customers < 400 Gallons)") +
labs(x = "Cold Drink Channel", y = "Total Gallons")`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Local Market Partners
The table below presents order volume and unit delta from 2023 to 2024 by customer, segmented by trade channel and sub-trade channel. Among the sub-trade channels, Cruises, Online Stores, Bulk Trade, and Comprehensive Providers exhibit higher unit delta year-over-year, indicating strong growth trends.
Additionally, the Fast Casual Dining trade channel stands out with notable year-over-year increases in order volume. This suggests that Fast Casual Dining customers within the local market partner segment could represent less obvious but high-growth potential opportunities, making them a strategic group to explore further.
# year over year changes in gallon volume by customer
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(customer_number, year, trade_channel, sub_trade_channel) %>%
summarize(gallons = sum(ordered_gallons + ordered_cases)) %>%
pivot_wider(names_from = year, values_from = gallons) %>%
mutate(unit_delta = (`2024` - `2023`)) %>%
arrange(desc(unit_delta))`summarise()` has grouped output by 'customer_number', 'year', 'trade_channel'.
You can override using the `.groups` argument.
# A tibble: 16,746 × 6
# Groups: customer_number, trade_channel [16,746]
customer_number trade_channel sub_trade_channel `2023` `2024` unit_delta
<chr> <fct> <fct> <dbl> <dbl> <dbl>
1 500733119 TRAVEL CRUISE 139520 1.69e5 29102
2 600064039 TRAVEL CRUISE 160752 1.88e5 27734.
3 600573998 TRAVEL CRUISE 376588 4.03e5 26077
4 500405618 BULK TRADE BULK TRADE 588 2.60e4 25422
5 501031108 SUPERSTORE ONLINE STORE 38909 6.12e4 22288.
6 600261951 TRAVEL CRUISE 106644 1.27e5 20502
7 600585734 TRAVEL CRUISE 84808 1.03e5 17747
8 600576028 SUPERSTORE ONLINE STORE 29744 4.65e4 16738.
9 501202275 PROFESSIONAL SERV… OTHER PROFESSION… 216 1.67e4 16494
10 501562278 GENERAL COMPREHENSIVE PR… 5858 2.22e4 16343
# ℹ 16,736 more rows
The graphs below illustrate the distribution of order types among local market partners. Notably, the EDI order type is the most dominant among local market partners. However, among customers with a volume of less than 400 gallons, Call Center and MyMoke360/Legacy orders remain the preferred choices.
While this difference may not seem highly significant, the shift in order type preference for customers with < 400 gallons—both in the local market partner segment and across all customers—could indicate that this variable holds some importance in customer behavior.
# for all local market partner customers
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(order_type) %>%
summarize(gallons = sum(ordered_gallons + ordered_cases)) %>%
ggplot(aes(x = reorder(order_type, -gallons), y = gallons, fill = order_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Barplot of Ordered Gallons by Order Type") +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Cases")# for customers with < 400 gallons per year
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(customer_number, order_type) %>%
summarize(total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)) %>%
filter(total_gallons < 400) %>%
group_by(order_type) %>%
summarize(gallons = sum(total_gallons)) %>%
ggplot(aes(x = reorder(order_type, -gallons), y = gallons, fill = order_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Cases")`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
The visuals below indicate that non-partners generally exhibit higher median total orders and a wider range of order volumes, with some extreme outliers. In contrast, local market partners tend to have lower total orders and a smaller interquartile range, suggesting less variability in their order quantities.
This difference in distribution may reflect distinct ordering behaviors between the two groups, warranting further analysis to understand the factors influencing order volume consistency among local market partners.
# Define threshold for filtering (optional)
NUMBER <- 1000 # Exclude extreme values if needed
# Filter dataset to exclude outliers (optional)
ORDERED_UNDER_NUMBER <- swire_data_full %>%
filter(ordered_gallons + ordered_cases <= NUMBER)
# Create boxplot to compare TOTAL_ORDERED by LOCAL_MARKET_PARTNER status
ggplot(ORDERED_UNDER_NUMBER, aes(x = local_market_partner, y = ordered_gallons + ordered_cases, fill = local_market_partner)) +
geom_boxplot(alpha =0.6) +# Boxplot with transparency
labs( title = paste("LOCAL_MARKET_PARTNER vs. TOTAL_ORDERED (Excluding Orders >", NUMBER,")"), x ="Local Market Partner Status", y ="Total Ordered") + theme_minimal()# Compute average TOTAL_ORDERED for Local Market Partners (TRUE) and Non-Partners (FALSE)
avg_total_order <- swire_data_full %>%
group_by(local_market_partner) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_gallons + ordered_cases, na.rm =TRUE))
# Display result
print(avg_total_order)# A tibble: 2 × 2
local_market_partner AVG_TOTAL_ORDERED
<lgl> <dbl>
1 FALSE 57.6
2 TRUE 32.2
C02 Customer
CO2 customers are an interesting subset of the data. We can look at customers who purchase fountain products (i.e.e “ordered_gallons”) and compare their distribution across CO2 customer status and even adjust for local market partner status.
swire_data_full |>
group_by(co2_customer, customer_number, local_market_partner) |>
summarise(total = sum(ordered_gallons)) |>
ungroup() |>
filter(total > 0) |>
ggplot(
aes(x = local_market_partner, total)
) +
geom_boxplot(color = swire_colors$red) +
scale_y_log10() +
facet_wrap(~co2_customer) +
labs(
title = "Distribution of Total Volume Among 'Fountain' Products",
subtitle = "Grouped by status as a CO2 customer",
x = "Local Market Partner?"
) +
theme_swire() +
theme(
axis.title.y = element_blank()
)`summarise()` has grouped output by 'co2_customer', 'customer_number'. You can
override using the `.groups` argument.
Among those order fountain products, we really aren’t seeing any measurable difference. There’s subtlties in the outliers but non-CO2 customers are meaningfully different in order volume. Even adjusting for local_market_partner isn’t demonstrating any meaningful pattern.
Customer Tenure
An interesting component to this analysis is varying levels of tenure, where some customers are “new” (their first order in the time window) while others having been customers for ages. We can look at order volume by tenure (years) for any patterns.
swire_data_full |>
group_by(customer_number) |>
summarise(
tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)),
total = sum(ordered_total)
) |>
ungroup() |>
mutate(tenure_years = as.factor(tenure_years)) |>
ggplot(
aes(tenure_years, total)
) +
geom_boxplot(color = swire_colors$red) +
scale_x_discrete() +
scale_y_log10() +
labs(
title = "Distribution of Total Volume",
subtitle = "Organized by tenure and order type"
) +
theme_swire()Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 137 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Predictably, customers with less tenure than our two (2) year window will have less than others. There also appears to be many outliers during the initial years of tenure. Later on, however, there’s more variability in the IQR. We can expand this even more by introducing another variable and simplifying to tenure bins.
swire_data_full |>
group_by(customer_number, order_type) |>
summarise(
tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)),
total = sum(ordered_total)
) |>
ungroup() |>
mutate(tenure_years_bin = as.factor(cut(tenure_years, breaks = 4))) |>
ggplot(
aes(tenure_years_bin, total)
) +
geom_boxplot(color = swire_colors$red) +
scale_y_log10() +
facet_wrap(~order_type) +
labs(
title = "Distribution of Total Volume",
subtitle = "Organized by tenure and order type"
) +
theme_swire()`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 2422 rows containing non-finite outside the scale range
(`stat_boxplot()`).
We see here that there really isn’t any pattern for time/tenure playing into the equation. We could look at the same for cold_drink_channel and the same bears true. It suggests that there’s no general trend of time being a factor for increased order volume, even down to sub groups.
Geographic Order Volume
We can also get a sense for concentration of orders by geography. We’ll build off of the map we created above but scale the size of the points to the volume of orders.
swire_cust_vol <-
swire_data_full |>
group_by(customer_number, lon, lat, local_market_partner) |>
summarise(total = sum(ordered_total)) |>
ungroup() |>
mutate(
total_scaled = (total / max(total)) * 100
)`summarise()` has grouped output by 'customer_number', 'lon', 'lat'. You can
override using the `.groups` argument.
swire_map_vol <-
leaflet(height = 800, width = 800) |>
addTiles() |>
setView(lng = mean(swire_cust_vol$lon), lat = mean(swire_cust_vol$lat), zoom = 5) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(
data = swire_cust_vol,
lng = ~lon,
lat = ~lat,
radius = ~sqrt(total_scaled),
color = ~swire_colors$red,
fillOpacity = 0.15
)
swire_map_volThere are certainly hot beds of more activity than others but there isn’t a strong rhyme or reason pattern that’s noticed from this view. Better still would be quantifying density to other delivery points and/or to sister franchises. We could use seomthing like the haversine formula as a proxy.
Conclusion
This exploratory data analysis (EDA) exercise has established a foundation for our group in some key areas:
We are far more familiar with Swire Coca-Cola’s business model, data structure, and customer nuances. This will equip us with necessary context for the modeling stage.
We have a solid understanding of areas to model and explore more deeply given the lessons learned and findings extracted. This means we’ll be more efficient and effective.
Our team has yielded “proof of concept” in how we collaborate cognitively (i.e. exploration of ideas) and structurally. This is crucial as the group looks to deliver a powerful solution to Swire Coca-Cola by project end.
During the next phase of the project, Modeling, our team will leverage these insights with more advanced analysis techniques, such as leveraging cluster algorithms, cost savings models, and both temporal and spatial analyses. This is all in the effort to develop a framework for better distinguishing high-growth potential from low-growth customers, offering a framework that improves on the current 400-gallon threshold.
Division of Work
Formulation of this group EDA was made possible thanks to the collaborative partnership from all group members.
As a group, we’ve partnered in the following ways:
Worked out of a group GitHub repository, following best practices of commits to branches and issuing pull requests, with teammates reviewing, approving, and merging work to
main.Partnered on data wrangling logic, determining how we would clean, standardize, and resolve inherent constraints of the provided data. This was the recipe for individual work described below.
Consolidated individual work into this document, ensuring thoughtful organization, engaging content, and attractive design.
Each member contributed with individual work as described below:
Indvidiually, we created a unique EDA document comprised of our own insights through extensive research and time with the data sets.
Individually, we created a cleaning script adhering to the defined plan as described above. This approach helped ensure all gained valuable programming experience and were intimately familiar with our cleaned data.
Thank you,
Adam, Georgia, Tyler, Zac